Often,
disaster recovery specialists refer to a seven-tier disaster recovery
paradigm. These tiers start at the bottom of the disaster recovery food
chain—Tier 0 (no offsite data—possibly no recovery) and progress up to
the highest level of disaster recovery possible—Tier 7 (zero to
near-zero data loss with highly automated recovery). We developed a
much more simplified and more generalized five-level representation
that should aid you in understanding DR and how you can approach it
more readily. Let’s start with where we think most small to midsize
companies find themselves: they have little to no disaster recovery
plan and are operating at a fairly high risk exposure level. When we
look at Figure 1,
this places those folks in the very bottom of the upside-down pyramid
at Level 1 and perhaps not even in the DR pyramid at all (below the
line—no data backup offsite at all—Level 0).
Very
likely, many of these companies are running a huge risk that they may
not really want to have. The good news is that getting to a more
protected level with your SQL Server–based applications isn’t really
that hard to do. But nothing is free. After you establish some type of
DR plan and create the technical process and mechanisms for your DR,
you still have to implement and test it. Now, let’s examine more
closely the disaster recovery levels and what level your company is
currently or needs to go to.
Level 0
is very understandable because it relates to undergoing a disaster and
being able to recover from it. You basically have to pick up the pieces
(after a fire or something) and see what is salvageable from your site.
Even the best onsite backup plans are rendered meaningless in regard to
disaster recovery if the backups were not stored somewhere safe
(offsite).
You must get to Level 1
as soon as possible if you are even remotely serious about disaster
recovery. We are talking about some very basic capability of creating a
recoverable image (database backups, system configuration backups, user
ID, permissions, role backups, and so on) that can effectively allow
you to rebuild your critical applications and databases that are
running on SQL Server at an alternate location. Data loss is probably
involved, but it may not be enough to cause your company to completely
go out of business. If you are not doing this right now, do it today!
Level 2
adds a much more real-time recovery time frame into the mix and gets to
more of a point-in-time recovery capability. Using capabilities such as
electronic vaulting aids greatly in your being able to restore systems
(databases) rapidly and get back online to the world within a short
amount of time. There is still a certain level of data loss here. That
cannot be avoided.
Level 3
moves you to the complexities of recovering transactional integrity in
your applications and minimizing your data losses. This level requires
much more effort, resources, and sophistication, but is very doable.
Finally, many larger companies have Level 4
in place to completely protect themselves from single-site failures or
disasters without missing a single order transaction. This chapter
identifies the different SQL Server–based options that allow you to
achieve from Level 1 through Level 4 DR. In some cases, achieving this
recovery involves using a Microsoft product combined with a third-party
product.
The mandate we put on you is
to devise a highly efficient disaster recovery plan in support of your
business continuity needs and then test it
completely! You need to make sure it considers all aspects of
completely being able to come up onto an alternate location as
smoothly, quickly, and with as little data loss as possible. Defining a
disaster recovery plan can be a tedious job owing to the potential
complexities of your configuration. But, keep in mind, a disaster
recovery plan is essential to your company’s sheer existence and should
be treated as such. Your objective is to move up the pyramid (to Level
4 or near it) to match your company’s needs for business continuity. We
don’t want to read in the trades how your company took weeks to recover
from a failed data center disaster and eventually went out of business;
industry statistics are against you if you haven’t prepared.
Disaster Recovery Patterns
In general, you should consider three main DR patterns when trying to achieve Level 1 through Level 4 DR.
Active/Passive DR Sites Pattern
Figure 2
illustrates the active/passive DR sites pattern. This typical disaster
recovery configuration is probably the most common in the world (for
those who are doing DR). It involves a primary site (the normal
environments in which you do business day in and day out) and a passive
DR site. This passive DR (alternative) site can be anything and
anywhere. It could be a “hot” alternate site that is ready for you to
apply a current database backup to (and application image, too), or it
could be a “cold” alternate site that you have to either go out and
lease, co-locate with, or build completely from scratch. The available
resources, money, and business need determines which method you choose.
Obviously, the colder your DR site, the longer it will take for you to
recover using it. A typical estimate of the time needed for a small to
midsized company to completely rebuild the essential systems on an
alternate site (DR site) is between 23 and 31 days.
The Microsoft products to
help you achieve this DR pattern are database backups taken offsite and
readily recallable to recover (restore) your database tier and, if you
have a “hot” DR
site available, you can utilize data replication to the DR site or log
shipping or even asynchronous database mirroring. Some third-party
products, such as Symantec’s Veritas Volume Replicator, push physical
byte-level changes to the passive (hot) DR site physical tier level. In
most of these options, the DR site is passive, literally sitting there
idle until it is needed. The only exception to that rule is when you
are using Microsoft data replication or database snapshots with
database mirroring. Even in those cases, the DR site does not support
transactional change (updates, deletes, inserts); it supports read
access only. Also, you need to remember that log shipping is on the way
out in future Microsoft releases, so don’t plan too much new usage of
that feature.
Active/Active DR Sites Pattern
An active/active DR
configuration essentially provides two primary sites that can process
transactions equally. Any particular external or internal usage of the
applications doesn’t really know which site is processing the requests
(and, in fact, doesn’t need to know). This configuration has the entire
application stack completely deployed on both sites. The data is kept
in sync in real-time, and either site could be lost and not affect the
operation of the company. Some data loss is possible, but this is
usually kept at a minimum, depending on which Microsoft options you are
using. Figure 3 shows an active/active DR configuration that bidirectionally keeps the databases in sync.
A
few things may cause issues for this pattern, such as the need to make
sure that no application keeps “state” from one transaction to the
other. Additionally, the application and/or the web tier needs to be
able to route user connections (the load) to either site in some type
of balanced or round-robin method. This is often done with big IP
routers that use round-robin routing algorithms, for example, to
determine which site to direct connections to. Active/active
configurations can be created using peer-to-peer continuous data
replication as well as other multi-updating subscriber replication
topologies. A slight twist to having two primary sites is to have one
primary site and a secondary site that doesn’t process transactions but
is actively used for reporting, testing, and other tasks (just no
processing that changes anything). In the event of a primary failure,
the secondary site can take over full primary site responsibilities
quickly. This is sort of active/passive, with active “secondary usage”
on the passive site (following the first active/passive DR pattern
described previously). This type of configuration can take advantage of
database mirroring and database snapshots (for the reporting). There
are plenty of advantages to this variation, which greatly distributes
the workload and moves up the DR pyramid.
Active Multisite DR Pattern
An active multisite DR
configuration contains three or more active sites, with the intention
of using any one of them as the DR site for the other (as shown in Figure 4).
This pattern allows you to distribute your applications redundantly
between any pair of sites, but not to all three (or more). For
instance, you could have half of Primary Site 1’s applications on
Primary Site 2 and the other half on Primary Site 3. This way, you
spread out the risk further and increase your odds of uninterrupted
processing.
Again, having
“stateless” applications is critical here, as is some smart routing of
all connections to the right sites. Using continuous data replication
and the database mirroring options allows you to easily create such a
DR topology. And, again, you also have the secondary usage variation
available to you if one or more alternative sites were passive (with
secondary usage supporting reporting, for example).
Choosing a Disaster Recovery Pattern
We reduce these to
patterns because, at the foundational level, they represent what you
need to do to support the level of business continuity your company
demands. Some companies can tolerate different levels of loss because
of the nature of their business; others cannot. At the highest levels,
it is fairly easy to match these patterns to what your business
requires. In this chapter, we look at what SQL Server capabilities are
available to help you implement these patterns.
Often,
global companies devise a DR configuration that reserves each major
data center site in their regions as the active or passive DR site for
another region. Figure 56.5
shows one large high-tech company’s global data center locations. Its
Alexandria, Virginia, site is also the passive DR site for its Phoenix,
Arizona, site. Its Paris, France, regional site is also the DR site for
its Alexandria, Virginia, site, and so on.
For companies that have
multiple data center sites but only need to support the active/passive
DR pattern, a very popular variation can be used. This variation is
called reciprocal DR. As you can see in Figure 6,
there are two sites (Site 1 and Site 2). Each is active for some
applications (Applications 1, 3, and 5 on Site 1 and Applications 2, 4,
and 6 on Site 2). Site 1’s applications are passively supported on Site
2, and Site 2’s applications are passively supported on Site 1. Rolling
out the configuration this way eliminates the “stateless” application
issue completely and is fairly easy to implement. It is also possible
to provide the passive applications data available via database
snapshots at the other reciprocal site (for free!), further leveraging
distributing workload geographically.
This configuration also
spreads out the risk of losing all applications if one site ever
happens to be lost (as in a disaster). Again, the Microsoft products to
help you achieve this DR pattern variation are data replication to the
DR site, or log shipping, and even asynchronous database mirroring with
database snapshots available to help with some distributed reporting.
As we noted previously, third-party products such as Symantec’s Veritas
Volume Replicator can be used to push physical byte-level changes to
the passive (hot) DR site physical tier level.
Recovery Objectives
You need to understand
two main recovery objectives: the point in time to which data must be
restored to be able to successfully resume processing (called the recovery point objective) and the acceptable amount of downtime that is tolerable (called the recovery time objective). The recovery point objective (RPO) is often thought of as the time between the last
backup and the point when the outage occurred. It indicates the amount
of data that will be lost. The recovery time objective (RTO) is
determined based on the acceptable downtime in case of a disruption of
operations. It indicates the latest point in time at which the business
operations must resume after disaster (that is, how much time can
elapse).
The RPO and RTO form the
basis on which a data protection strategy is developed. This helps to
provide a picture of the total time that a business may lose due to a
disaster. The two of them together are very important requirements when
designing a solution. Let’s put these terms in the form of algorithms:
RTO = Difference between the time of the disaster to the time the system is operational – Time operational (up) – Time disaster occurred (down)
RPO = Time since the last backup of complete transactions representing data that must be re-acquired or entered – Time disaster occurred – Time of last usable data backup
Therefore:
Total lost business time = Time operational (up) – Time disaster occurred (down) – Time of the last usable data backup
Knowing your RPO and RTO requirements is essential in determining what DR pattern to use and what Microsoft options to utilize.
A Data-Centric Approach to Disaster Recovery
Disaster recovery is
a complex undertaking unto itself. However, it isn’t really necessary
to recover every system or application in the event of a disaster.
Priorities must be set on determining exactly which systems or
applications must be recovered. These are typically the
revenue-generating applications (such as order entry, order
fulfillment, and invoicing) that your business relies on to do basic
business with its customers. Therefore, you set the highest priorities
for DR with those revenue-generating systems. Then the next level of
recovery is for the second-priority applications (such as HR systems).
After you prioritize which
applications should be part of your DR plans, you need to fully
understand what must be included in recovery to ensure that these
priority applications are fully functional. The best way is to take a
data-centric approach, which focuses on what data is needed to bring up
the application. Data comes in many flavors, as Figure 7 shows:
Metadata— The data that describes structures, files, XSDs, and so on that the applications, middleware, or back end needs.
Configuration data— The data that the application needs to define what it must do, or the middleware needs to execute with, and so on.
Application data values— The data itself within your database files that represents the transactional data in your systems.
As just mentioned, you
first identify which applications you must include in your DR plans,
and then you must make sure you back up and are able to recover that
application’s data
(metadata, configuration data, and application data). As part of this
exercise, you must determine how tightly or loosely coupled the data is
to other applications. In other words (as you can also see in Figure 7),
if on the back-end tier, Database A has the orders transactions and
Database B has the invoicing data, both must be included in the DR
plans (because they are tightly coupled). In addition, you must also
know how tightly or loosely coupled the application stack components
are with each layer. In other words (again looking at Figure 7),
if the ERP application (in the application tier) requires some type of
middleware to be present to handle all its messaging, that middleware
tier component is tightly coupled with the ERP application and so on.